monitor lifecycle conductor#2723
Conversation
Hello benzekrimaha,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
380069a to
25ea9d5
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files
... and 6 files with indirect coverage changes
@@ Coverage Diff @@
## development/9.3 #2723 +/- ##
===================================================
+ Coverage 74.48% 74.54% +0.05%
===================================================
Files 200 200
Lines 13603 13690 +87
===================================================
+ Hits 10132 10205 +73
- Misses 3461 3475 +14
Partials 10 10
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
8316f88 to
408c96c
Compare
408c96c to
e1c5b13
Compare
e1c5b13 to
aefb677
Compare
| const log = this.logger.newRequestLogger(); | ||
| const start = new Date(); | ||
| const start = Date.now(); | ||
| this._scanId = uuid(); |
There was a problem hiding this comment.
Hmm, we're storing the scan ID as a "global" field variable, but it sounds like it is really relevant/used only inside this function (through indirect calls). Could we drop the global field and instead pass it through to whatever uses it? Maybe in _createBucketTaskMessages?
725c3df to
11a94ea
Compare
a2128cf to
a464b39
Compare
Review by Claude Code |
|
|
LGTM — solid implementation of conductor scan tracking. |
Review by Claude Code |
|
PR #2723: monitor lifecycle conductor — Review Summary This PR adds comprehensive lifecycle conductor scan monitoring: a scan id (UUID v4) propagated through the Kafka message pipeline, new Prometheus metrics (scan end time, bucket count, per-scan message counter, message age histogram), Grafana dashboard panels, and two new alerts (LifecycleStalledScan, LifecycleBucketProcessorMultipleParallelScans). The refactoring of context propagation via Findings:
No security issues, no breaking changes to Kafka message formats (new fields are additive and handled gracefully during rolling upgrades). Review by Claude Code |
|
LGTM — well-structured PR. The scan-context propagation through conductor → bucket-task messages → action entries is thorough and consistent. Metric cleanup timers correctly prevent unbounded prom-client memory growth. Alert PromQL for LifecycleStalledScan handles the edge cases (first scan, scan completion, vector(0) fallback). V1/V2 continuation-entry context is correctly preserved via _makeContinuationEntry with the LifecycleTaskV2 override of _getBucketEntryRequestIdContext. Test coverage is solid across metrics, conductor lifecycle, scan-context propagation, and the BackbeatTestConsumer sentinel handling. |
| nBucketsQueued, | ||
| }); | ||
| if (scanStarted) { | ||
| this._completeCurrentScan(log, totalBucketsListed); |
There was a problem hiding this comment.
On the error-after-start path we still call _completeCurrentScan, which sets latest_batch_end_time and latest_batch_bucket_count. Two consequences:
- A failed scan then looks "completed" to
LifecycleStalledScan(the alert seesend ≥ start), so a scan that started, errored, and wedged won't trip the stalled-scan alert it's meant to catch. bucket_countis published with a misleading partial value.
Resetting _currentScanId/_currentScanStartTimestamp on failure is correct, but should we record the end metrics on the error path at all? Suggest resetting scan state without setting end-time/bucket-count when the scan failed.
| conductorScanId: scanId, | ||
| conductorScanStartTimestamp: start, | ||
| }); | ||
| LifecycleMetrics.onProcessBuckets(log, start); |
There was a problem hiding this comment.
This changes the meaning of an existing metric. s3_lifecycle_latest_batch_start_time used to be set at scan completion (previously onProcessBuckets was called in the success callback); it's now set at scan start.
That's a behavior change for anything already reading this metric — in particular LifecycleLateScan. After this change LateScan means "the conductor hasn't started a scan recently" and no longer catches "a scan started but hung"; that coverage moves entirely to the new LifecycleStalledScan. So the two new StalledScan rules aren't purely additive — they backfill coverage this flip removed from LateScan.
Can we confirm this is intended and call it out explicitly in the PR description / changelog so downstream dashboards and alerts are aware?
| }); | ||
|
|
||
| const bucketProcessorScanMessageAgeSeconds = ZenkoMetrics.createHistogram({ | ||
| name: 's3_lifecycle_bucket_processor_scan_message_age_seconds', |
There was a problem hiding this comment.
The help text says the age is measured "when they finish processing in the bucket processor," but onBucketProcessorScanMessageReceived is called at message pickup (in _processBucketEntry, before fetching the bucket lifecycle config or scheduling the task). So this histogram actually measures "elapsed wall-time since the scan started, sampled at dequeue" — a backlog/lag signal, not processing time. Continuation slices also inherit the original scan-start timestamp, so age keeps growing across a long scan regardless of when a given slice was enqueued.
Either fix the help text to describe what's measured, or move the observation to actual task completion if processing time is what we want.
| }); | ||
|
|
||
| const bucketProcessorScanMessagesProcessed = ZenkoMetrics.createCounter({ | ||
| name: 's3_lifecycle_bucket_processor_scan_messages_processed_total', |
There was a problem hiding this comment.
Naming nit: this counter is incremented at message receipt, before processing and regardless of success or object count (the JSDoc on onBucketProcessorScanMessageReceived says as much), yet it's named ..._scan_messages_processed_total and the method says ...Received. "processed" overstates what it counts. Suggest ..._scan_messages_received_total to match the semantics and the method name.
| assert(parsedMsg.contextInfo?.reqId, 'expected contextInfo.reqId field'); | ||
| parsedMsg.contextInfo.reqId = expectedMsg.value.contextInfo?.reqId; | ||
| expectedValue.contextInfo.reqId = parsedMsg.contextInfo.reqId; | ||
| if (expectedValue.contextInfo?.conductorScanId === 'test-scan-id') { |
There was a problem hiding this comment.
Building on François's earlier point about this utility: this doesn't actually test scan-id propagation. It overwrites the expected conductorScanId/conductorScanStartTimestamp with whatever the actual message contained, and never asserts the fields are present — so a message that omitted them entirely would still pass deepStrictEqual. Unlike the reqId case just above, there's no assert(parsedMsg.contextInfo?.conductorScanId, ...) guarding presence.
Either add a presence assertion (mirroring the reqId check) or move the scan-id assertion back into the test rather than the shared consumer.
| processActionEntry(entry, done) { | ||
| const startTime = Date.now(); | ||
| const log = this.logger.newRequestLogger(); | ||
| const conductorScanId = entry.getContextAttribute('conductorScanId'); |
There was a problem hiding this comment.
Consistency: there are now three different ways scan context gets onto the logger across the action tasks. LifecycleUpdateExpirationTask does log.addDefaultFields(entry.getLogInfo()), while here and in LifecycleUpdateTransitionTask we manually getContextAttribute('conductorScanId') + getContextAttribute('conductorScanStartTimestamp') + addDefaultFields.
Since ActionQueueEntry._loggedAttributes now includes both scan fields, the manual extraction is redundant — these two could use entry.getLogInfo() too (this is the direction François asked for earlier). Standardizing on one approach would be cleaner.
| } | ||
| }, | ||
| "concurrency": 10, | ||
| "scanMetricRetentionS": 86400, |
There was a problem hiding this comment.
The retention default now lives in three places: 86400 here, and DEFAULT_SCAN_METRIC_RETENTION_S = 24 * 60 * 60 duplicated in both LifecycleMetrics.js and LifecycleConfigValidator.js. Suggest a single exported constant (re-used by the validator's .default(...)) so these can't drift apart.
| }); | ||
|
|
||
| describe('_indexesGetOrCreate', () => { | ||
| it('should include conductor scan id in task context', () => { |
There was a problem hiding this comment.
This test exercises _taskToMessage, but it's placed under describe('_indexesGetOrCreate'). Minor, but worth moving it to a block that matches what it covers (or its own describe('_taskToMessage')).
| } | ||
|
|
||
| const ageSeconds = (Date.now() - conductorScanStartTimestamp) / 1000; | ||
| if (ageSeconds >= 0) { |
There was a problem hiding this comment.
The ageSeconds >= 0 guard drops the observation entirely when the computed age is negative, rather than clamping it to 0. Since the age is a cross-host subtraction (Date.now() here minus the conductor's Date.now() carried in the message), small negative values are expected and dropping them silently removes the fastest samples, biasing the histogram upward. Suggest observe(..., Math.max(0, ageSeconds)) so those samples still land in the lowest bucket.
Issue: BB-740